Back in 2014, fivethiryeight.com published an article on alchohol consumption in different countries. The data drinks is available as part of the fivethirtyeight package. Make sure you have installed the fivethirtyeight package before proceeding.
library(fivethirtyeight)
data(drinks)
# or download directly
# alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")What are the variable types? Any missing values we should worry about?
glimpse(drinks)## Rows: 193
## Columns: 5
## $ country <chr> "Afghanistan", "Albania", "Algeria", "An…
## $ beer_servings <int> 0, 89, 25, 245, 217, 102, 193, 21, 261, …
## $ spirit_servings <int> 0, 132, 0, 138, 57, 128, 25, 179, 72, 75…
## $ wine_servings <int> 0, 54, 14, 312, 45, 45, 221, 11, 212, 19…
## $ total_litres_of_pure_alcohol <dbl> 0.0, 4.9, 0.7, 12.4, 5.9, 4.9, 8.3, 3.8,…
skim(drinks)| Name | drinks |
| Number of rows | 193 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| country | 0 | 1 | 3 | 28 | 0 | 193 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beer_servings | 0 | 1 | 106.16 | 101.14 | 0 | 20.0 | 76.0 | 188.0 | 376.0 | ▇▃▂▂▁ |
| spirit_servings | 0 | 1 | 80.99 | 88.28 | 0 | 4.0 | 56.0 | 128.0 | 438.0 | ▇▃▂▁▁ |
| wine_servings | 0 | 1 | 49.45 | 79.70 | 0 | 1.0 | 8.0 | 59.0 | 370.0 | ▇▁▁▁▁ |
| total_litres_of_pure_alcohol | 0 | 1 | 4.72 | 3.77 | 0 | 1.3 | 4.2 | 7.2 | 14.4 | ▇▃▅▃▁ |
There doent seem to be any missing values
beer_plot <- drinks %>%
arrange(desc(beer_servings)) %>%
select(country, beer_servings) %>%
head(25)
ggplot(beer_plot, aes(x= beer_servings, y = reorder(country, beer_servings))) +
geom_col() +
labs(
title = "Top 25 beer consuming countries",
x = "Servings",
y = "Countries"
)wine_plot <- drinks %>%
arrange(desc(wine_servings)) %>%
select(country, wine_servings) %>%
head(25)
ggplot(wine_plot, aes(x= wine_servings, y = reorder(country, wine_servings))) +
geom_col() +
labs(
title = "Top 25 wine consuming countries",
x = "Servings",
y = "Countries"
)spirit_plot <- drinks %>%
arrange(desc(spirit_servings)) %>%
select(country, spirit_servings) %>%
head(25)
ggplot(spirit_plot, aes(x= spirit_servings, y = reorder(country, spirit_servings))) +
geom_col() +
labs(
title = "Top 25 spirit consuming countries",
x = "Servings",
y = "Countries"
)What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
The top consuming countries of one kind of beverage are generally not among the the top countries for a second beverage. This means most countries have one favourite beverage that brings them on top of the ranking. There also seems to be a regional and cultural division amongst the beverages:
We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Aveng…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", …
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorro…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 2…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, …
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08,…
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08,…
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 92…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, …
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 3…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2,…
Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:
gross : The gross earnings in the US box office, not adjusted for inflationbudget: The movie’s budgetcast_facebook_likes: the number of facebook likes cast memebrs receivedvotes: the number of people who voted for (or rated) the movie in IMDBreviews: the number of reviews for that movierating: IMDB average ratingskim(movies)| Name | movies |
| Number of rows | 2961 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| title | 0 | 1 | 1 | 83 | 0 | 2907 | 0 |
| genre | 0 | 1 | 5 | 11 | 0 | 17 | 0 |
| director | 0 | 1 | 3 | 32 | 0 | 1366 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2.00e+03 | 9.95e+00 | 1920.0 | 2.00e+03 | 2.00e+03 | 2.01e+03 | 2.02e+03 | ▁▁▁▂▇ |
| duration | 0 | 1 | 1.10e+02 | 2.22e+01 | 37.0 | 9.50e+01 | 1.06e+02 | 1.19e+02 | 3.30e+02 | ▃▇▁▁▁ |
| gross | 0 | 1 | 5.81e+07 | 7.25e+07 | 703.0 | 1.23e+07 | 3.47e+07 | 7.56e+07 | 7.61e+08 | ▇▁▁▁▁ |
| budget | 0 | 1 | 4.06e+07 | 4.37e+07 | 218.0 | 1.10e+07 | 2.60e+07 | 5.50e+07 | 3.00e+08 | ▇▂▁▁▁ |
| cast_facebook_likes | 0 | 1 | 1.24e+04 | 2.05e+04 | 0.0 | 2.24e+03 | 4.60e+03 | 1.69e+04 | 6.57e+05 | ▇▁▁▁▁ |
| votes | 0 | 1 | 1.09e+05 | 1.58e+05 | 5.0 | 1.99e+04 | 5.57e+04 | 1.33e+05 | 1.69e+06 | ▇▁▁▁▁ |
| reviews | 0 | 1 | 5.03e+02 | 4.94e+02 | 2.0 | 1.99e+02 | 3.64e+02 | 6.31e+02 | 5.31e+03 | ▇▁▁▁▁ |
| rating | 0 | 1 | 6.39e+00 | 1.05e+00 | 1.6 | 5.80e+00 | 6.50e+00 | 7.10e+00 | 9.30e+00 | ▁▁▆▇▁ |
There are no missing values, there are however duplicates
movies %>%
count(genre) %>%
rename(count_movie = n) %>%
arrange(desc(count_movie))## # A tibble: 17 x 2
## genre count_movie
## <chr> <int>
## 1 Comedy 848
## 2 Action 738
## 3 Drama 498
## 4 Adventure 288
## 5 Crime 202
## 6 Biography 135
## 7 Horror 131
## 8 Animation 35
## 9 Fantasy 28
## 10 Documentary 25
## 11 Mystery 16
## 12 Sci-Fi 7
## 13 Family 3
## 14 Musical 2
## 15 Romance 2
## 16 Western 2
## 17 Thriller 1
gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Ranked genres by this return_on_budget in descending ordermovies %>%
mutate(return_on_budget = gross / budget) %>%
group_by(genre) %>%
summarise(avg_gross = mean(gross),
avg_budget = mean(budget),
avg_return_on_budget = mean(return_on_budget)) %>%
arrange(desc(avg_return_on_budget))## # A tibble: 17 x 4
## genre avg_gross avg_budget avg_return_on_budget
## <chr> <dbl> <dbl> <dbl>
## 1 Horror 37713738. 13504916. 88.3
## 2 Biography 45201805. 28543696. 22.3
## 3 Musical 92084000 3189500 18.8
## 4 Family 149160478. 14833333. 14.1
## 5 Documentary 17353973. 5887852. 8.70
## 6 Western 20821884 3465000 7.06
## 7 Fantasy 42408841. 17582143. 6.68
## 8 Animation 98433792. 61701429. 5.01
## 9 Comedy 42630552. 24446319. 3.71
## 10 Mystery 67533021. 39218750 3.27
## 11 Romance 31264848. 25107500 3.17
## 12 Drama 37465371. 26242933. 2.95
## 13 Adventure 95794257. 66290069. 2.41
## 14 Crime 37502397. 26596169. 2.17
## 15 Action 86583860. 71354888. 1.92
## 16 Sci-Fi 29788371. 27607143. 1.58
## 17 Thriller 2468 300000 0.00823
movies %>%
group_by(director) %>%
summarise(sum_gross_director = sum(gross),
mean_director = mean(gross),
median_director = median(gross),
std_director = sd(gross)) %>%
arrange(desc(sum_gross_director)) %>%
head(15)## # A tibble: 15 x 5
## director sum_gross_direct… mean_director median_director std_director
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Steven Spielberg 4014061704 174524422. 164435221 101421051.
## 2 Michael Bay 2231242537 171634041. 138396624 127161579.
## 3 Tim Burton 2071275480 129454718. 76519172 108726924.
## 4 Sam Raimi 2014600898 201460090. 234903076 162126632.
## 5 James Cameron 1909725910 318287652. 175562880. 309171337.
## 6 Christopher Nol… 1813227576 226653447 196667606. 187224133.
## 7 George Lucas 1741418480 348283696 380262555 146193880.
## 8 Robert Zemeckis 1619309108 124562239. 100853835 91300279.
## 9 Clint Eastwood 1378321100 72543216. 46700000 75487408.
## 10 Francis Lawrence 1358501971 271700394. 281666058 135437020.
## 11 Ron Howard 1335988092 111332341 101587923 81933761.
## 12 Gore Verbinski 1329600995 189942999. 123207194 154473822.
## 13 Andrew Adamson 1137446920 284361730 279680930. 120895765.
## 14 Shawn Levy 1129750988 102704635. 85463309 65484773.
## 15 Ridley Scott 1128857598 80632686. 47775715 68812285.
df_ratings <- movies %>%
group_by(genre) %>%
summarise(mean_rating = mean(rating),
median_rating = median(rating),
std_rating = sd(rating),
min_rating = min(rating),
max_rating = max(rating))
df_ratings## # A tibble: 17 x 6
## genre mean_rating median_rating std_rating min_rating max_rating
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Action 6.23 6.3 1.03 2.1 9
## 2 Adventure 6.51 6.6 1.09 2.3 8.6
## 3 Animation 6.65 6.9 0.968 4.5 8
## 4 Biography 7.11 7.2 0.760 4.5 8.9
## 5 Comedy 6.11 6.2 1.02 1.9 8.8
## 6 Crime 6.92 6.9 0.849 4.8 9.3
## 7 Documentary 6.66 7.4 1.77 1.6 8.5
## 8 Drama 6.73 6.8 0.917 2.1 8.8
## 9 Family 6.5 5.9 1.22 5.7 7.9
## 10 Fantasy 6.15 6.45 0.959 4.3 7.9
## 11 Horror 5.83 5.9 1.01 3.6 8.5
## 12 Musical 6.75 6.75 0.636 6.3 7.2
## 13 Mystery 6.86 6.9 0.882 4.6 8.5
## 14 Romance 6.65 6.65 0.636 6.2 7.1
## 15 Sci-Fi 6.66 6.4 1.09 5 8.2
## 16 Thriller 4.8 4.8 NA 4.8 4.8
## 17 Western 5.70 5.70 2.26 4.1 7.3
ggplot(df_ratings, aes(x = mean_rating)) +
geom_histogram(color = "black", fill = "blue") +
labs(
title = "Distribution of average genre rating",
x = "Avg Rating",
y = "Frequency")ggplot to answer the followinggross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?ggplot(movies, aes(x= gross, y= cast_facebook_likes)) +
geom_point() +
labs(
title = "Scatterplot of gross earnings on cast's facebook likes",
x = "Gross Earnings",
y = "Cast's Facebook Likes") Facebook likes will most likely not be a good predictor for box office performance. One reason for this could be that the cast (if new or unknown) would start getting facebook likes after a successful movie
Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
ggplot(movies, aes(x= gross, y= budget)) +
geom_point() +
labs(
title = "Scatterplot of gross earnings on budget",
x = "Gross Earnings",
y = "Budget") A movies budget could have the potential to be a good predictor for movie success, as only good screenplays would be allowed to spend a lot on actually filming and marketing the movie.
gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?ggplot(movies, aes(x= rating, y= gross)) +
geom_point() +
facet_wrap(vars(genre)) +
labs(
title = "Scatterplots of gross earnings on ratings by genre",
y = "Gross Earnings",
x = "IMDB Rating") IMDB ratings do seem to generally be a good predictor of the money a movie will make. Especially within the genres where we have a lot of data we can see a positive correlation. There are some stranger genres where this doesn’t hold to be true however. Documentaries and Fantasy movies don’t show a correlation, so their generated revenues don’t seem to be related to ratings.
You may find useful the material on finance data sources.
We will use the tidyquant package to download historical data of stock prices, calculate returns, and examine the distribution of returns.
We must first identify which stocks we want to download data for, and for this we must know their ticker symbol; Apple is known as AAPL, Microsoft as MSFT, McDonald’s as MCD, etc. The file nyse.csv contains 508 stocks listed on the NYSE, their ticker symbol, name, the IPO (Initial Public Offering) year, and the sector and industry the company is in.
nyse <- read_csv(here::here("data","nyse.csv"))Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
companies_per_sector <-
nyse %>%
count(sector) %>%
rename(count_company = n) %>%
arrange(desc(count_company))
companies_per_sector## # A tibble: 12 x 2
## sector count_company
## <chr> <int>
## 1 Finance 97
## 2 Consumer Services 79
## 3 Public Utilities 60
## 4 Capital Goods 45
## 5 Health Care 45
## 6 Energy 42
## 7 Technology 40
## 8 Basic Industries 39
## 9 Consumer Non-Durables 31
## 10 Miscellaneous 12
## 11 Transportation 10
## 12 Consumer Durables 8
ggplot(companies_per_sector, aes(y = reorder(sector, count_company), x = count_company))+
geom_bar(stat = "identity" ) +
labs(
title = "Companies per Sector on NYSE - Bar Plot",
x = "Sector",
y = "# of companies on NYSE"
)Next, let’s choose the Dow Jones Industrial Aveareg (DJIA) stocks and their ticker symbols and download some data. Besides the thirty stocks that make up the DJIA, we will also add SPY which is an SP500 ETF (Exchange Traded Fund).
djia_url <- "https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average"
#get tables that exist on URL
tables <- djia_url %>%
read_html() %>%
html_nodes(css="table")
# parse HTML tables into a dataframe called djia.
# Use purr::map() to create a list of all tables in URL
djia <- map(tables, . %>%
html_table(fill=TRUE)%>%
clean_names())
# constituents
table1 <- djia[[2]] %>% # the second table on the page contains the ticker symbols
mutate(date_added = ymd(date_added),
# if a stock is listed on NYSE, its symbol is, e.g., NYSE: MMM
# We will get prices from yahoo finance which requires just the ticker
# if symbol contains "NYSE*", the * being a wildcard
# then we jsut drop the first 6 characters in that string
ticker = ifelse(str_detect(symbol, "NYSE*"),
str_sub(symbol,7,11),
symbol)
)
# we need a vector of strings with just the 30 tickers + SPY
tickers <- table1 %>%
select(ticker) %>%
pull() %>% # pull() gets them as a sting of characters
c("SPY") # and lets us add SPY, the SP500 ETF# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, # cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd
myStocks <- tickers %>%
tq_get(get = "stock.prices",
from = "2000-01-01",
to = "2020-08-31") %>%
group_by(symbol)
glimpse(myStocks) # examine the structure of the resulting data frame## Rows: 153,121
## Columns: 8
## Groups: symbol [31]
## $ symbol <chr> "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM", "MMM…
## $ date <date> 2000-01-03, 2000-01-04, 2000-01-05, 2000-01-06, 2000-01-07,…
## $ open <dbl> 48.0, 46.4, 45.6, 47.2, 50.6, 50.2, 50.4, 51.0, 50.7, 50.4, …
## $ high <dbl> 48.2, 47.4, 48.1, 51.2, 51.9, 51.8, 51.2, 51.8, 50.9, 50.5, …
## $ low <dbl> 47.0, 45.3, 45.6, 47.2, 50.0, 50.0, 50.2, 50.4, 50.2, 49.5, …
## $ close <dbl> 47.2, 45.3, 46.6, 50.4, 51.4, 51.1, 50.2, 50.4, 50.4, 49.7, …
## $ volume <dbl> 2173400, 2713800, 3699400, 5975800, 4101200, 3863800, 235760…
## $ adjusted <dbl> 28.1, 26.9, 27.7, 30.0, 30.5, 30.4, 29.9, 30.0, 30.0, 29.5, …
Financial performance analysis depend on returns; If I buy a stock today for 100 and I sell it tomorrow for 101.75, my one-day return, assuming no transaction costs, is 1.75%. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.
#calculate daily returns
myStocks_returns_daily <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "daily",
type = "log",
col_rename = "daily_returns",
cols = c(nested.col))
#calculate monthly returns
myStocks_returns_monthly <- myStocks %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "monthly",
type = "arithmetic",
col_rename = "monthly_returns",
cols = c(nested.col))
#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
group_by(symbol) %>%
tq_transmute(select = adjusted,
mutate_fun = periodReturn,
period = "yearly",
type = "arithmetic",
col_rename = "yearly_returns",
cols = c(nested.col))Create a dataframe and assign it to a new object, where you summarise monthly returns since 2017-01-01 for each of the stocks and SPY; min, max, median, mean, SD.
monthly_returns_17_to_20 <-
myStocks_returns_monthly %>%
filter(date >= as.Date("2017-01-01")) %>%
group_by(symbol) %>%
summarise(mean_monthly_returns = mean(monthly_returns),
median_monthly_returns = median(monthly_returns),
std_monthly_returns = sd(monthly_returns),
min_monthly_returns = min(monthly_returns),
max_monthly_returns = max(monthly_returns)) %>%
arrange(desc(mean_monthly_returns)) # Arrange for mean return to have a better overview
monthly_returns_17_to_20 # Print table## # A tibble: 31 x 6
## symbol mean_monthly_re… median_monthly_… std_monthly_ret… min_monthly_ret…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.0387 0.0513 0.0873 -0.181
## 2 CRM 0.0350 0.0403 0.0850 -0.155
## 3 MSFT 0.0327 0.0288 0.0503 -0.0840
## 4 V 0.0253 0.0281 0.0520 -0.114
## 5 NKE 0.0213 0.0271 0.0672 -0.119
## 6 HD 0.0213 0.0252 0.0626 -0.151
## 7 WMT 0.0196 0.0257 0.0535 -0.156
## 8 UNH 0.0186 0.0211 0.0637 -0.115
## 9 AMGN 0.0171 0.0235 0.0664 -0.104
## 10 MCD 0.0164 0.0157 0.0534 -0.148
## # … with 21 more rows, and 1 more variable: max_monthly_returns <dbl>
Plot a density plot, using geom_density(), for each of the stocks
ggplot(myStocks_returns_monthly, aes(monthly_returns)) +
geom_density() +
geom_vline(aes(xintercept=mean(monthly_returns)), # Add a mean line for clarity
color="red", linetype="dashed", size=0.5) +
facet_wrap(vars(symbol)) +
labs(
title = "Monthly Return Distribution by Stock",
x = "Monthly Returns",
y = "Density"
)What can you infer from this plot? Which stock is the riskiest? The least risky?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
The density plots show us two things in general:
These two points can be then be used to assess the overall performance of the stock over the past years and the risk associated with them.
Looking only at the plots we can see that DOW and AAPL are very widely spread and are therefore the riskiest stocks. On the other side of the spectrum we have the SPY ETF, which is to be expected as its an index fund that is well diversified.
Finally, produce a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock with its ticker symbol
monthly_returns_17_to_20## # A tibble: 31 x 6
## symbol mean_monthly_re… median_monthly_… std_monthly_ret… min_monthly_ret…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.0387 0.0513 0.0873 -0.181
## 2 CRM 0.0350 0.0403 0.0850 -0.155
## 3 MSFT 0.0327 0.0288 0.0503 -0.0840
## 4 V 0.0253 0.0281 0.0520 -0.114
## 5 NKE 0.0213 0.0271 0.0672 -0.119
## 6 HD 0.0213 0.0252 0.0626 -0.151
## 7 WMT 0.0196 0.0257 0.0535 -0.156
## 8 UNH 0.0186 0.0211 0.0637 -0.115
## 9 AMGN 0.0171 0.0235 0.0664 -0.104
## 10 MCD 0.0164 0.0157 0.0534 -0.148
## # … with 21 more rows, and 1 more variable: max_monthly_returns <dbl>
ggplot(monthly_returns_17_to_20, aes(x = std_monthly_returns, y = mean_monthly_returns)) +
geom_point() +
geom_text_repel(aes(label = symbol),
size = 2.5) +
geom_smooth(method = "lm") + # add in regression line to see under and over performing stocks in terms of risk vs return
expand_limits(x = 0, y = 0) + # set x and y axis to begin at 0 for clarity
labs(
title = "Risk vs Return Plot - DJIA",
x = "Monthly Return Standard Deviation",
y = "Mean Monthly Return"
)What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?
TYPE YOUR ANSWER AFTER (AND OUTSIDE!) THIS BLOCKQUOTE.
In finance there is a general expectation of returns being related to risk. In that sense we would be expecting to see a positive correlation between the stocks mean returns and return standard deviation. This tendency can not be seen in our data set from 2017 to 2020, however we did find the said positive correlation in the entire data set from 07 to 20.
We added a regression line onto our plot to see the average relationship between risk and return in our sample. This helps us to decide which stock can generally be seen to be too risky whilst not providing enough returns or which stocks seem to have performed very well, producing high returns with less risk involved. The line seperates these two categories:
Notably strong stocks (being less risky than would have been expected for the achieved returns) are:
Notably weak stocks (being more risky than would have been expected for the achieved returns) are:
If one had to choose only one stock with the above information it would be one of Apple, Salesforce or Microsoft which outperform the remaining sample in terms of achieved returns for taken risk.
We will analyse a data set on Human Resoruce Analytics. The IBM HR Analytics Employee Attrition & Performance data set is a fictional data set created by IBM data scientists. Among other things, the data set includes employees’ income, their distance from work, their position in the company, their level of education, etc. A full description can be found on the website.
hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, …
## $ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "No", …
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Trave…
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358…
## $ Department <chr> "Sales", "Research & Development", "Research…
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26,…
## $ Education <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3,…
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "…
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16…
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3,…
## $ Gender <chr> "Female", "Male", "Male", "Female", "Male", …
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, …
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2,…
## $ JobLevel <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1,…
## $ JobRole <chr> "Sales Executive", "Research Scientist", "La…
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3,…
## $ MaritalStatus <chr> "Single", "Married", "Single", "Married", "M…
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 26…
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 996…
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5,…
## $ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes"…
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, …
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3,…
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2,…
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ StockOptionLevel <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0,…
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, …
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4,…
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3,…
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4…
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2,…
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0,…
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3,…
I am going to clean the data set, as variable names are in capital letters, some variables are not really necessary, and some variables, e.g., education are given as a number rather than a more useful description
hr_cleaned <- hr_dataset %>%
clean_names() %>%
mutate(
education = case_when(
education == 1 ~ "Below College",
education == 2 ~ "College",
education == 3 ~ "Bachelor",
education == 4 ~ "Master",
education == 5 ~ "Doctor"
),
environment_satisfaction = case_when(
environment_satisfaction == 1 ~ "Low",
environment_satisfaction == 2 ~ "Medium",
environment_satisfaction == 3 ~ "High",
environment_satisfaction == 4 ~ "Very High"
),
job_satisfaction = case_when(
job_satisfaction == 1 ~ "Low",
job_satisfaction == 2 ~ "Medium",
job_satisfaction == 3 ~ "High",
job_satisfaction == 4 ~ "Very High"
),
performance_rating = case_when(
performance_rating == 1 ~ "Low",
performance_rating == 2 ~ "Good",
performance_rating == 3 ~ "Excellent",
performance_rating == 4 ~ "Outstanding"
),
work_life_balance = case_when(
work_life_balance == 1 ~ "Bad",
work_life_balance == 2 ~ "Good",
work_life_balance == 3 ~ "Better",
work_life_balance == 4 ~ "Best"
)
) %>%
select(age, attrition, daily_rate, department,
distance_from_home, education,
gender, job_role,environment_satisfaction,
job_satisfaction, marital_status,
monthly_income, num_companies_worked, percent_salary_hike,
performance_rating, total_working_years,
work_life_balance, years_at_company,
years_since_last_promotion)attrition)hr_cleaned %>%
filter(attrition == "Yes") %>%
summarise(mean_until_attrition = mean(years_at_company, na.rm = TRUE))## # A tibble: 1 x 1
## mean_until_attrition
## <dbl>
## 1 5.13
Employees of the company leave the firm after 5.13 years on average.
age, years_at_company, monthly_income and years_since_last_promotion distributed?summarise_age <-
hr_cleaned %>%
summarise(stat = "Age",
mean_stat = mean(age),
median_stat= median(age),
std_stat = sd(age),
min_stat = min(age),
max_stat = max(age))
summarise_years <-
hr_cleaned %>%
summarise(stat = "Years at Company",
mean_stat = mean(years_at_company),
median_stat= median(years_at_company),
std_stat = sd(years_at_company),
min_stat = min(years_at_company),
max_stat = max(years_at_company))
summarise_income <-
hr_cleaned %>%
summarise(stat = "Monthly Income",
mean_stat = mean(monthly_income),
median_stat= median(monthly_income),
std_stat = sd(monthly_income),
min_stat = min(monthly_income),
max_stat = max(monthly_income))
summarise_promotion <-
hr_cleaned %>%
summarise(stat = "Years since last Promotion",
mean_stat = mean(years_since_last_promotion),
median_stat= median(years_since_last_promotion),
std_stat = sd(years_since_last_promotion),
min_stat = min(years_since_last_promotion),
max_stat = max(years_since_last_promotion))
all_stats_summarise <- bind_rows(summarise_age,
summarise_years,
summarise_income,
summarise_promotion)
all_stats_summarise## # A tibble: 4 x 6
## stat mean_stat median_stat std_stat min_stat max_stat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Age 36.9 36 9.14 18 60
## 2 Years at Company 7.01 5 6.13 0 40
## 3 Monthly Income 6503. 4919 4708. 1009 19999
## 4 Years since last Promotion 2.19 1 3.22 0 15
From our analysis we can infer the approximate distribution of our four variables age, years_at_company, monthly_income and years_since_last_promotion. A normal distribution can among other things be characterised by its mean and median being close to identical and both lying in the middle of the range between the maximum and minimum value. We can therefore summarise the following findings:
job_satisfaction and work_life_balance distributed?distr_satisf <- hr_cleaned %>%
select(job_satisfaction) %>%
count(job_satisfaction) %>%
mutate(freq = n / sum(n)) %>%
arrange(match(job_satisfaction, c("Very High","High","Medium","Low"))) %>%
rename(count = n)
distr_balance <- hr_cleaned %>%
select(work_life_balance) %>%
count(work_life_balance) %>%
mutate(freq = n / sum(n)) %>%
arrange(match(work_life_balance,c("Best","Better","Good","Bad"))) %>%
rename(count = n)
distr_satisf## # A tibble: 4 x 3
## job_satisfaction count freq
## <chr> <int> <dbl>
## 1 Very High 459 0.312
## 2 High 442 0.301
## 3 Medium 280 0.190
## 4 Low 289 0.197
distr_balance## # A tibble: 4 x 3
## work_life_balance count freq
## <chr> <int> <dbl>
## 1 Best 153 0.104
## 2 Better 893 0.607
## 3 Good 344 0.234
## 4 Bad 80 0.0544
We can see that the job satisfaction and work life balance are not distributed similarly. Both have the majority of employees in the most positive categories, however for work life balance this share is slightly higher at around 70% compared to roughly 60% among job satisfaction. There is also a substantially higher share of employees who are dissatisfied with their job (20%) whereas only around 5% claim to have a bad work life balance. Whereas it is often discussed that work life balance is a main driver of job satisfaction in businesses, this indicates that it is not be the only driver.
ggplot(hr_cleaned, aes(x = reorder(education, monthly_income, FUN = mean), y = monthly_income)) + # Boxplots are ordered by mean monthly income. This provides additional information about the data as median income and inter-quartile ranges are already graphically demonstrated.
geom_boxplot() +
labs(
title = "Income Distribution by Education Level",
x = "Education Level",
y = "Monthly Income"
)ggplot(hr_cleaned, aes(x = reorder(gender, monthly_income, FUN = mean), y = monthly_income)) +
geom_boxplot() +
labs(
title = "Income Distribution by Gender",
x = "Gender",
y = "Monthly Income"
)Overall, there seems to be a positive correlation between income and education level. This suggests that the higher the level of education, the higher the monthly income on average. We can also notice that the spread of income levels within the different educational levels increases with increased education. This could be interpreted so that a higher level of education opens up higher salary levels which might have been impossible to achive with a lower level of education, however that it doesnt guarantee these higher levels.
Meanwhile, there also seems to be a relationship between income and gender. Males have a lower mean, median, 25-p & 75-p than females, this difference is however minimal. Interestingly, males do also have a higher number of outliers on the top end of the spectrum.
ggplot(hr_cleaned, aes(y = reorder(job_role, monthly_income, FUN = mean), x = monthly_income)) +
geom_boxplot() +
labs(
title = "Income Distribution by Job Role",
y = "Role",
x = "Monthly Income"
) The above highlights which roles are being paid most. They also show that there is a tendency for higher paid roles to have a more spread salary distribution.
hr_cleaned %>%
group_by(education) %>%
summarise(mean_income = mean(monthly_income)) %>%
ggplot(aes(x = reorder(education, mean_income, FUN = mean), y = mean_income)) +
geom_col() +
labs(
title = "Mean Income by Education Level",
y = "Mean Income",
x = "Education Level")Due to our previously conducted analysis we have reason to believe that education level has a bigger effect on overall salary range rather than actual salary. Plotting the median income would not show us these outliers that highlight career opportunities at the top end of the spectrum so we decided to plot the mean salary by education. We can see that this increases with educational level as expected.
hr_cleaned %>%
ggplot(aes(y = reorder(education, monthly_income, FUN = mean), x = monthly_income)) +
geom_boxplot() +
facet_wrap(vars(performance_rating)) +
theme_economist() +
scale_fill_economist() +
labs(
title = "Income vs Education - Split by performance rating",
x = "Monthly Income",
y = "Education Level"
)Again we can see our previously discovered tendency for education to allow for a greater range of salaries. In the above boxplots we can again see that a better performance (“Outstanding”) tends to lead to a greater salary spread and slightly higher mean salary for better educated employees. This does not hold perfectly true but there is a general tendency from below college ducation to having a doctor. This indicates that good work especially pays off when you have the right educational level to step up in the company.
job_role# find job role with the highest mean salary
prof_role <- hr_cleaned %>%
group_by(job_role) %>%
summarise(mean_income = mean(monthly_income)) %>%
arrange(mean_income)
# convert list of best paid job roles into a vector
order_role <- pull(prof_role, job_role)
# set factor of job role to mean monthly income so that upcoming facet wrap is ordered correctly
inc_age_by_role <- hr_cleaned %>%
select(monthly_income, age, job_role) %>%
mutate(job_role = fct_reorder(job_role, monthly_income, .fun = mean))
ggplot(inc_age_by_role, aes(x = monthly_income, y = age)) +
geom_point() +
facet_wrap(vars(job_role)) +
theme_economist() +
labs(
title = "Income vs Age - Split by role",
x = "Monthly Income",
y = "Age"
)We can see that the tendency of age relating to monthly income increases with higher earning job roles. In low paying roles such as Sales Representative the age of employee does not seem to be driving the monthly income they receive. On the other end of the spectrum, the Manager roles, we can see that there is a tendency of older employees to earn more.
The purpose of this exercise is to make a publication-ready plot using your dplyr and ggplot2 skills. Open the journal article “Riddell_Annals_Hom-Sui-Disparities.pdf”. Read the abstract and have a look at Figure 3. The data you need is “CDC_Males.csv”.
Don’t worry about replicating it exactly, try and see how far you can get. You’re encouraged to work together if you want to and exchange tips/tricks you figured out.
You may find these helpful:
# Read csv file and import into dataframe
cdc_males <- read_csv("data/CDC_Males.csv")
# Calculate Spearman correlation coefficient
correl_cdc <- cdc_males %>%
filter(type.fac == "Firearm-related") %>%
drop_na(adjusted.suicide.White, adjusted.homicide.White) %>%
summarise(correl = cor(adjusted.suicide.White, adjusted.homicide.White, method = "spearman"))
# Create Scatterplot
cdc_males[!is.na(cdc_males$gun.house.prev.category),] %>% # get ris of all NAs in gun.house.prev.category
filter(type.fac == "Firearm-related") %>%
ggplot(aes(x = adjusted.suicide.White, y = adjusted.homicide.White, colour = gun.house.prev.category, rm.na = TRUE)) +
geom_point(aes(size = average.pop.white)) + # Make datapoint size dependent on average white population size
scale_size_area(breaks = c(500000, 1500000, 3000000, 7000000), # Specify category breaks for datapoint size
labels = c('500k', '1.5m', '3m', '7m'), max_size = 14) +
scale_color_manual(values = c("10.2%-19.9%" = "#fef0d9", # specify colours of range buckets
"20.0%-34.9%" = "#fdcc8a",
"35.0%-44.9%" = "#fc8d59",
"45.0%-65.5%" = "#d7301f")) +
geom_text_repel(aes(label = ST), # Add state labels to datapoints
size = 5,
color = "Black") +
annotate(x = 25, y = 0.5, # Add correlation to graph
label = paste("Spearman Rho: ", round(correl_cdc, 2)),
geom="text", size=5) +
theme_light() +
labs(
title = "Annual rates of firearm homicide and suicide among white men, by state, household firearm ownership",
x = "White Suicide Rate (per 100,000 per Year)",
y = "White Homicide Rate (per 100,000 per Year)",
size = "White Population",
colour = "Gun Ownership"
)As discussed in class, I would like you to reproduce the plot that shows the top ten cities in highest amounts raised in political contributions in California during the 2016 US Presidential election.
To get this plot, you must join two dataframes; the one you have with all contributions, and data that can translate zipcodes to cities. You can find a file with all US zipcodes, e.g., here http://www.uszipcodelist.com/download.html.
The easiest way would be to create two plots and then place one next to each other. For this, you will need the patchwork package. https://cran.r-project.org/web/packages/patchwork/index.html
While this is ok, what if one asked you to create the same plot for the top 10 candidates and not just the top two? The most challenging part is how to reorder within categories, and for this you will find Julia Silge’s post on REORDERING AND FACETTING FOR GGPLOT2 useful.
# Make sure you use vroom() as it is significantly faster than read.csv()
CA_contributors_2016 <- vroom::vroom(here::here("data","CA_contributors_2016.csv"))
zipcode <- vroom("http://www.uszipcodelist.com/zip_code_database.csv")
library(tidytext)
cleaned_state_data <- zipcode %>%
select(zip, primary_city) %>%
mutate_at("zip", as.numeric) # Make sure zip are numeric values
Contribution_plot <- left_join(CA_contributors_2016, cleaned_state_data, by = c("zip" = "zip")) %>%
select(cand_nm, contb_receipt_amt, zip, primary_city) %>%
filter(cand_nm %in% c("Trump, Donald J.", "Clinton, Hillary Rodham")) %>% # filter for zip codes where DT OR HC received contributions
group_by(primary_city, cand_nm) %>%
summarise(total_in_primary_city = sum(contb_receipt_amt)) %>%
arrange(desc(total_in_primary_city)) %>%
ungroup %>%
group_by(cand_nm) %>%
slice_head(n = 10)
# Create column plot where we reorder cities for both candidates according to highest total amounts with reorder_within()
ggplot(Contribution_plot, aes(x = total_in_primary_city, y = reorder_within(primary_city, total_in_primary_city, cand_nm), fill = cand_nm)) +
geom_col(show.legend = FALSE) +
facet_wrap(~cand_nm, scales = "free") +
scale_y_reordered() + # clean the y-axis
scale_x_continuous(labels=scales::dollar_format()) + # clean the x-axis
labs(
title = "Where did candidates raise most money?",
x = "Amount raised",
y = element_blank()
) +
theme(panel.border = element_rect(colour = "black", fill=NA, size=1),
strip.background = element_rect(color = "black", size = 1),
panel.background = element_rect(fill = "white"),
panel.grid.minor = element_line(colour = "grey", size = 0.5),
panel.grid.major = element_line(colour = "grey", size = 0.5)) +
scale_fill_manual(values = c("Clinton, Hillary Rodham" = "#1C49C3",
"Trump, Donald J." = "#D80D22"))There is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.